INTERSPEECH.2008 - Speech Processing | Cool Papers

#1 Low complexity near-optimal unit-selection algorithm for ultra low bit-rate speech coding based on n-best lattice and Viterbi search [PDF] [Copy] [Kimi¹]

Authors: V. Ramasubramanian ; D. Harish

We propose a low complexity unit-selection algorithm for ultra low bit-rate speech coding based on a first-stage N-best prequantization lattice and a second-stage run-length constrained Viterbi search to efficiently approximate the complete search space of the fully-optimal unit-selection algorithm recently proposed by us. By this, the proposed low complexity algorithm continues to be near-optimal in terms of rate-distortion performance while having highly reduced complexity.

#2 A new fast algebraic fixed codebook search algorithm in CELP speech coding [PDF] [Copy] [Kimi¹]

Authors: Vaclav Eksler ; Redwan Salami ; Milan Jelinek

This paper introduces a very fast search algorithm of algebraic fixed codebook in CELP-based speech codecs. The proposed method searches codebook pulses sequentially, and recomputes the fixed codebook gain, the so-called backward filtered target vector, and a certain reference signal after each new pulse is determined. This results in a significant complexity reduction compared to existing methods while preserving the same speech quality. The presented algorithm is used in the new embedded speech and audio codec (G.EV-VBR) being currently standardized by the ITU-T.

#3 A novel transcoding algorithm between 3GPP AMR-NB (7.95kbit/s) and ITU-t g.729a (8kbit/s) [PDF] [Copy] [Kimi¹]

Authors: Hao Xu ; Changchun Bao

In this paper, a novel transcoding algorithm specially related to codebook gain conversion is proposed between AMR-NB at 7.95kb/s and G.729a. It can bypass the gain predictive process and directly convert codebook gain parameters. Additionally, the new gain parameter conversion method can be extended to the other rate modes of AMR-NB while transcoding with G.729a. The experimental results show that the quality of transcoded speech using the proposed algorithm is improved greatly and the computational complexity is reduced by 85% compared with DTE (Decode then Encode) method. A 5ms look-ahead delay is avoided as well.

#4 Mel-frequency cepstral coefficient-based bandwidth extension of narrowband speech [PDF] [Copy] [Kimi¹]

Authors: Amr H. Nour-Eldin ; Peter Kabal

We present a novel MFCC-based scheme for the Bandwidth Extension (BWE) of narrowband speech. BWE is based on the assumption that narrowband speech (0.3.3.4 kHz) correlates closely with the highband signal (3.4.7 kHz), enabling estimation of the highband frequency content given the narrow band. While BWE schemes have traditionally used LP-based parametrizations, our recent work has shown that MFCC parametrization results in higher correlation between both bands reaching twice that using LSFs. By employing high-resolution IDCT of highband MFCCs obtained from narrowband MFCCs by statistical estimation, we achieve high-quality highband power spectra from which the time-domain speech signal can be reconstructed. Implementing this scheme for BWE translates the higher correlation advantage of MFCCs into BWE performance superior to that obtained using LSFs, as shown by improvements in log-spectral distortion as well as Itakura-based measures (the latter improving by up to 13%).

#5 A PCM coding noise reduction for ITU-t g.711.1 [PDF] [Copy] [Kimi¹]

Authors: Jean-Luc Garcia ; Claude Marro ; Balazs Kövesi

The ITU-T G.711.1 embedded wideband speech codec was approved by ITU-T in March 2008. This codec generates a bitstream comprised of three layers: a G.711 compatible core layer with noise shaping, a lower band enhancement layer and an MDCT-based higher band enhancement layer. It contains also an optional postprocessing module called Appendix I designed to improve quality of the decoded speech in case of interoperability condition with legacy G.711 encoder. The improvement is achieved by a novel low complexity PCM quantization noise reduction technique described in this article. Subjective test results show that the quality of the interoperability mode with the legacy G.711 codec is significantly better when the Appendix I is activated.

#6 An instrumental measure for end-to-end speech transmission quality based on perceptual dimensions: framework and realization [PDF] [Copy] [Kimi¹]

Authors: Marcel Wältermann ; Kirstin Scholz ; Sebastian Möller ; Lu Huo ; Alexander Raake ; Ulrich Heute

In this contribution, a new instrumental measure for end-to-end speech transmission quality is presented which is based on perceptually relevant dimensions. The paper describes the complete scientific development process of such a measure, starting off from the general framework and concluding with the concrete realization. The measure is based on the dimensions "discontinuity", "noisiness", and "coloration", which were identified through multidimensional analyses. Three dimension estimators are introduced which are capable to predict so-called dimension impairment factors on the basis of signal parameters. Each dimension impairment factor reflects the degradation with respect to a single perceptual dimension. By combining the impairment factors, integral quality can be estimated. A maximum correlation of r = 0.9 with auditory test results is achieved for a wide range of perceptually different conditions.

#7 Enhancement of noisy speech recordings via blind source separation [PDF] [Copy] [Kimi¹]

Authors: Jiri Malek ; Zbynek Koldovsky ; Jindrich Zdansky ; Jan Nouza

We propose an improved time-domain Blind Source Separation method and apply it to speech signal enhancement using multiple microphone recordings. The improvement consists in utilization of fuzzy clustering instead of a hard one, which is verified by experiments where real-world mixtures of two audio signals are separated from two microphones. Performance of the method is demonstrated by recognizing mixed and separated utterances from the Czech part of the European broadcast news database using our Czech LVCSR system. The separation allows significantly better recognition, e.g., by 32% when the jammer signal is a Gaussian noise and the input signal-to-noise ratio is 10dB.

#8 Studies on estimation of the number of sources in blind source separation [PDF] [Copy] [Kimi¹]

Authors: Takaaki Ishibashi ; Hidetoshi Nakashima ; Hiromu Gotanda

ICA (Independent Component Analysis) can estimate unknown source signals from their mixtures under the assumption that the source signals are statistically independent. However, in a real environment, the separation performance is often deteriorated because the number of the source signals is different from that of the sensors. In this paper, we propose an estimation method for the number of the sources based on the joint distribution of the observed signals under two-sensor configuration. From several simulation results, it is found that the number of the sources is coincident to that of peaks in the histogram of the distribution. The proposed method can estimate the number of the sources even if it is larger than that of the observed signals.

#9 Speech enhancement based on hypothesized Wiener filtering [PDF¹] [Copy] [Kimi¹]

Authors: V. Ramasubramanian ; Deepak Vijaywargi

We propose a novel speech enhancement technique based on the hypothesized Wiener filter (HWF) methodology. The proposed HWF algorithm selects a filter for enhancing the input noisy signal by first 'hypothesizing' a set of filters and then choosing the most appropriate one for the actual filtering. We show that the proposed HWF can intrinsically offer superior performance to conventional Wiener filtering (CWF) algorithms, which typically perform a selection of a filter based only on the noisy input signal which results in a sub-optimal choice of the filter. We present results showing the advantages of HWF based speech enhancement over CWF, particularly with respect to the baseline performances achievable by HWF and with respect to the type of clean frames used, namely, codebooks vs a large number of clean frames. We show the consistently better performance of HWF based speech enhancement (over CWF) in terms of spectral distortion at various input SNR levels.

#10 Psychoacoustically-motivated adaptive β-order generalized spectral subtraction based on data-driven optimization [PDF] [Copy] [Kimi¹]

Authors: Junfeng Li ; Hui Jiang ; Masato Akagi

To mitigate the performance limitations caused by the constant spectral order β in the traditional spectral subtraction methods, we previously presented an adaptive β-order generalized spectral subtraction (GSS) in which the spectral order β is updated in a heuristic way. In this paper, we propose a psychoacousticallymotivated adaptive β-order GSS, by considering that different frequency bands contribute different amounts to speech intelligibility (i.e., the band-importance function). Specifically, in this proposed adaptive β-order GSS, the tendency of spectral order β to change with the input local signal-to-noise ratio (SNR) is quantitatively approximated by a sigmoid function, which is derived through a data-driven optimization procedure by minimizing the intelligibility-weighted distance between the desired speech spectrum and its estimate. The inherent parameters of the sigmoid function are further optimized with the data-driven optimization procedure. Experimental results indicate that the proposed psychoacoustically-motivated adaptive β-order GSS yields great improvements over the traditional spectral subtraction methods with the intelligibility-weighted measures.

#11 Two stage iterative Wiener filtering for speech enhancement [PDF] [Copy] [Kimi¹]

Authors: Krishna Nand K ; T. V. Sreenivas

We formulate a two-stage Iterative Wiener filtering (IWF) approach to speech enhancement, bettering the performance of constrained IWF, reported in literature. The codebook constrained IWF (CCIWF) has been shown to be effective in achieving convergence of IWF in the presence of both stationary and non-stationary noise. To this, we include a second stage of unconstrained IWF and show that the speech enhancement performance can be improved in terms of average segmental SNR (SSNR), Itakura-Saito (IS) distance and Linear Prediction Coefficients (LPC) parameter coincidence. We also explore the tradeoff between the number of CCIWF iterations and the second stage IWF iterations.

#12 Assessment of correlation between objective measures and speech recognition performance in the evaluation of speech enhancement [PDF] [Copy] [Kimi¹]

Authors: Pei Ding ; Jie Hao

Speech enhancement is widely used to improve the perceptual quality of noisy speech by suppressing the interfering ambient noise and is commonly evaluated via objective quality measures. Automatic speech recognition (ASR) systems also use such speech enhancement technologies in front-end to improve their noise robustness. If the objective measures have a high correlation with speech recognition accuracy, we can effectively predict the ASR performance according to objective quality measures in advance and flexibly optimize the enhancement algorithms in the stage of system design. Motivated by such idea, this paper investigates the correlation between the ASR performance and several traditional objective measures based on Aurora2 database. In the Experimental results the highest correlation coefficient, 0.962, is provided by weighted spectral slope measure (WSS).

#13 Effect of compressing the dynamic range of the power spectrum in modulation filtering based speech enhancement [PDF] [Copy] [Kimi¹]

Authors: James G. Lyons ; Kuldip K. Paliwal

In the modulation-filtering based speech enhancement method, noise suppression is achieved by bandpass filtering the temporal trajectories of the power spectrum. In the literature, some authors use the power spectrum directly for modulation filtering, while others use different compression functions for reducing the dynamic range of the power spectrum prior to its modulation filtering. This paper compares systematically different dynamic range compression functions applied to the power spectrum for speech enhancement. Subjective listening tests and objective measures are used to evaluate the quality as well as the intelligibility of the enhanced speech. The quality is measured objectively in terms of the Perceptual Estimation of Speech Quality (PESQ) measure and the intelligibility in terms of the Speech Transmission Index (STI) measure. It is found that P0.3333 (power spectrum raised to power 1/3) results in the highest speech quality and intelligibility.

#14 A long state vector kalman filter for speech enhancement [PDF] [Copy] [Kimi¹]

Authors: Stephen So ; Kuldip K. Paliwal

In this paper, we investigate a long state vector Kalman filter for the enhancement of speech that has been corrupted by white and coloured noise. It has been reported in previous studies that a vector Kalman filter achieves better enhancement than the scalar Kalman filter and it is expected that by increasing the state vector length, one may improve the enhancement performance even further. However, any enhancement improvement that may result from an increase in state vector length is constrained by the typical use of short, non-overlapped speech frames, as the autocorrelation coefficient estimates tend to become less reliable at higher lags. We propose to overcome this problem by incorporating an analysismodification- synthesis framework, where long, overlapped frames are used instead. Our enhancement experiments based on the NOIZEUS corpus show that the proposed long state vector Kalman filter achieves higher mean SNR and PESQ scores than the scalar and short state vector Kalman filter, therefore fulfilling the notion that a longer state vector can lead to better enhancement.

#15 Subspace based speech enhancement using Gaussian mixture model [PDF] [Copy] [Kimi¹]

Authors: Achintya Kundu ; Saikat Chatterjee ; T. V. Sreenivas

Traditional subspace based speech enhancement (SSE) methods use linear minimum mean square error (LMMSE) estimation that is optimal if the Karhunen Loeve transform (KLT) coefficients of speech and noise are Gaussian distributed. In this paper, we investigate the use of Gaussian mixture (GM) density for modeling the non-Gaussian statistics of the clean speech KLT coefficients. Using Gaussian mixture model (GMM), the optimum minimum mean square error (MMSE) estimator is found to be nonlinear and the traditional LMMSE estimator is shown to be a special case. Experimental results show that the proposed method provides better enhancement performance than the traditional subspace based methods.

#16 Generalized parametric spectral subtraction using weighted Euclidean distortion [PDF] [Copy] [Kimi¹]

Authors: Amit Das ; John H. L. Hansen

An improved version of the original parametric formulation of the generalized spectral subtraction method is presented in this study. The original formulation uses parameters that minimize the mean-square error (MSE) between the estimated and true speech spectral amplitudes. However, the MSE does not take into account any perceptual measure. We propose two new short-time spectral amplitude estimators based on a perceptual error criterion . the weighted Euclidean distortion. The error function is easily adaptable to penalize spectral peaks and valleys differently. Performance evaluations were performed using two noise types over four SNR levels and compared to the original parametric formulation. Results demonstrate that for most cases the proposed estimators achieve greater noise suppression without introducing speech distortion.

#17 Sudden noise reduction based on GMM with noise power estimation [PDF] [Copy] [Kimi¹]

Authors: Nobuyuki Miyake ; Tetsuya Takiguchi ; Yasuo Ariki

This paper describes a method for reducing sudden noise using noise detection and classification methods, and noise power estimation. Sudden noise detection and classification have been dealt with in our previous study. In this paper, noise classification is improved to classify more kinds of noises based on k-means clustering, and GMM-based noise reduction is performed using the detection and classification results. As a result of classification, we can determine the kind of noise we are dealing with, but the power is unknown. In this paper, this problem is solved by combining an estimation of noise power with the noise reduction method. In our experiments, the proposed method achieved good performance for recognition of utterances overlapped by sudden noises.

#18 Speech enhancement using a wiener denoising technique and musical noise reduction [PDF] [Copy] [Kimi¹]

Authors: Md. Jahangir Alam ; Sid-Ahmed Selouani ; Douglas O'Shaughnessy ; Sofia Ben Jebara

Speech enhancement methods using spectral subtraction have the drawback of generating an annoying residual noise with musical character. In this paper a frequency domain optimal linear estimator with perceptual post filtering is proposed which incorporates the masking properties of the human auditory system to make the residual noise distortion inaudible. The performance of the proposed enhancement algorithm is evaluated by the Segmental SNR, Log Spectral Distance (LSD) and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the Wiener denoising technique.

#19 Regularized non-negative matrix factorization with temporal dependencies for speech denoising [PDF] [Copy] [Kimi¹]

Authors: Kevin W. Wilson ; Bhiksha Raj ; Paris Smaragdis

We present a technique for denoising speech using temporally regularized nonnegative matrix factorization (NMF). In previous work [1], we used a regularized NMF update to impose structure within each audio frame. In this paper, we add frame-to-frame regularization across time and show that this additional regularization can also improve our speech denoising results. We evaluate our algorithm on a range of nonstationary noise types and outperform a state-of-the-art Wiener filter implementation.

#20 ICA-based MAP speech enhancement with multiple variable speech distribution models [PDF] [Copy] [Kimi¹]

Authors: Xin Zou ; Peter Jančovič ; Munevver Kokuer ; Martin J. Russell

This paper proposes a novel ICA-based MAP speech enhancement algorithm using multiple variable speech distribution models. The proposed algorithm consists of two stages, primary and advanced enhancement. The primary enhancement is performed by employing a single distribution model obtained from all speech signals. The advanced enhancement first employs multiple models of speech signals, each modeling a specific type of speech, and then adapts these model parameters for each speech frame by employing the enhanced signal from the primary estimation. A statistical noise adaptation technique has been employed to better model the noise in non-stationary case. The proposed algorithm has been evaluated on speech from TIMIT database corrupted by various noises and it has shown significantly improved performance over using the single speech distribution model.

#21 Source separation based on binaural cues and source model constraints [PDF] [Copy] [Kimi¹]

Authors: Ron J. Weiss ; Michael I. Mandel ; Daniel P. W. Ellis

We describe a system for separating multiple sources from a two-channel recording based on interaural cues and known characteristics of the source signals. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels. In simulated reverberant mixtures of three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 2.1 dB over a baseline algorithm using only interaural cues.

#22 Maximum kurtosis beamforming with the generalized sidelobe canceller [PDF] [Copy] [Kimi¹]

Authors: Kenichi Kumatani ; John McDonough ; Barbara Rauch ; Philip N. Garner ; Weifeng Li ; John Dines

This paper presents an adaptive beamforming application based on the capture of far-field speech data from a real single speaker in a real meeting room. After the position of a speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so that the distribution of an output signal is as non-Gaussian as possible. We consider kurtosis in order to measure the degree of non-Gaussianity. Our beamforming algorithms can suppress noise and reverberation without the signal cancellation problems encountered in conventional beamforming algorithms. We demonstrate the effectiveness of our proposed techniques through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). The beamforming algorithm proposed here achieved a 13.6% WER, whereas the simple delay-and-sum beamformer provided a WER of 17.8%.

#23 Noise robust speech dereverberation using constrained inverse filter [PDF] [Copy] [Kimi¹]

Authors: Ken'ichi Furuya ; Akitoshi Kataoka ; Yoichi Haneda

A noise robust dereverberation method is presented for speech enhancement in noisy reverberant conditions. This method introduces the constraint of minimizing the noise power in the inverse filter computation of dereverberation. It is shown that there exists a tradeoff between reducing the reverberation and reducing the noise; this tradeoff can be controlled by the constraint. Inverse filtering reduces early reflections and directional noise. In addition, spectral subtraction is used to suppress the tail of the inversefiltered reverberation and residual noise. The performance of our method is objectively and subjectively evaluated in experiments using measured room impulse responses. The results indicate that this method provides better speech quality than the conventional methods.

#24 A dual microphone coherence based method for speech enhancement in headsets [PDF] [Copy] [Kimi¹]

Authors: Mohsen Rahmani ; Ahmad Akbari ; Beghdad Ayad

The performance of two-microphone coherence based methods degrades if two captured noises are correlated. The Cross Power Spectrum Subtraction (CPSS) is an adapted coherence method for noise correlated environments. In this paper, we propose a new technique for estimation of speech cross power spectrum density and we exploit it in CPSS. The proposed speech enhancement method is evaluated as a speech recognition preprocessing system and as an independent speech enhancement system. The enhancement results show the practical superiority of the proposed method comparing with the previous solutions.

#25 Sound capture system and spatial filter for small devices [PDF] [Copy] [Kimi¹]

Authors: Ivan Tashev ; Slavy Mihov ; Tyler Gleghorn ; Alex Acero

Usage of cellular phones and small form factor devices as PDAs and other handhelds has been increasing rapidly. Their use is varied, with scenarios such as communication, internet browsing, audio and video recording just to name a few. This requires better sound capturing system as the sound source is already at larger distance from the device's microphone. In this paper we propose sound capture system for small devices which uses two unidirectional microphones placed back-to-back close to each other. The processing part consists of beamformer and a non-linear spatial filter. The speech enhancement processing achieves an improvement of 0.39 MOS points in the perceptual sound quality and 10.8 dB improvement in SNR.